Visualized Text-to-Image Retrieval

(*, Equal Contribution)
University of California, Los Angeles

We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings.


Methodology

Image 1

VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features.

Visual-RAG-ME: A Challenging Benchmark

We introduce Visual-RAG-ME, a new benchmark for evaluating T2I retrieval and knowledge-intensive VQA. It contains 50 high quality queries on comparing the visual features between two similar organisms.

VisRet Improves Retrieval

roc-auc

Evaluation results across three T2I retrieval benchmarks using different retrieval strategies and retrievers. We use GPT-4o for T2I instruction generation and GPT-Image-1 for T2I generation. The best results in each column within each retriever group are boldfaced. R = Recall. N = NDCG.

VisRet Improves Downstream VQA

roc-auc

VQA performance comparison using different LVLMs as instruction generators for VisRet and query rephrase models. CLIP is used as the retriever. Boldfaced numbers indicate the best in each column.

BibTeX

@article{wu2025visualizedtexttoimageretrieval,
      title={Visualized Text-to-Image Retrieval}, 
      author={Di Wu and Yixin Wan and Kai-Wei Chang},
      year={2025},
      eprint={2505.20291},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.20291}, 
}